จากความเป็นวนซ้ำสู่การมุ่งเน้น: การแก้ไขข้อจำกัดของการจำลองลำดับ

การจำลองลำดับแบบดั้งเดิมพึ่งพาอย่างมากกับ เครือข่ายประสาทเทียมแบบวนซ้ำ (RNNs) และรูปแบบที่มีประตูควบคุม (LSTMs, GRUs) แม้ว่าจะก้าวหน้าสำหรับงานแปลลำดับ-ต่อ-ลำดับในช่วงแรก แต่สถาปัตยกรรมเหล่านี้มีปัญหาเรื่องความสามารถในการขยายตัวอย่างรุนแรงเมื่อจัดการกับความสัมพันธ์ที่ยาวไกล ความก้าวหน้าของกลไกการมุ่งเน้นได้ให้แนวทางแนวคิดสำคัญที่จำเป็นเพื่อผ่านข้อจำกัดเหล่านี้ และทำให้ระบบประมวลผลภาษาธรรมชาติ (NLP) สมัยใหม่ที่มีประสิทธิภาพสูงเกิดขึ้นได้

1. ปัญหาความสัมพันธ์ระยะไกล

ใน RNN ทางเชื่อมโยงระหว่างโทเค็น $t_i$ กับโทเค็น $t_j$ ต้องผ่านขั้นตอนกลางทั้งหมดแบบลำดับ ซึ่งบังคับให้สัญญาณเกรเดียนต์ในระหว่างการถอยกลับ (backpropagation) ต้องคูณซ้ำกับเมทริกซ์น้ำหนัก ทำให้สัญญาณลดลงอย่างรวดเร็ว (เกรเดียนต์หายไป) สัญญาณ ซึ่งทำให้แทบเป็นไปไม่ได้ที่จะส่งข้อมูลที่มีประโยชน์หรือสัญญาณข้อผิดพลาดไปยังระยะทางไกลในลำดับ ความซับซ้อนของเส้นทางคือ $O(N)$

2. ปัญหาตู้กั้นบริบทขนาดคงที่

สถาปัตยกรรมมาตรฐาน encoder-decoder ก่อนที่จะมีการมุ่งเน้น ต้องการให้ความหมายทั้งหมดของลำดับต้นทาง ไม่ว่าจะยาวแค่ไหน ต้องถูกบีบอัดให้อยู่ในเวกเตอร์เดียวที่มีขนาดคงที่ (เวกเตอร์บริบท, $C$) เวกเตอร์บริบท) ปัญหานี้จำกัดศักยภาพของโมเดลในการเก็บข้อมูลที่จำเป็นทั้งหมด โดยเฉพาะกับข้อมูลที่ยาวหรือซับซ้อน ทำให้เกิดการสูญเสียข้อมูลสำคัญในช่วงการถอดรหัส

การแสดงแนวคิด

RNN Context Bottleneck

A visualization illustrating the traditional RNN Encoder-Decoder structure where the sequence is compressed into a single, fixed-size vector before being passed to the decoder. This point of compression often results in the loss of fine-grained information required for accurate long-sequence translation.

Diagram of an RNN Encoder-Decoder showing the context vector bottleneck

Question 1

Why is the dependency path length in a standard RNN considered a major limitation for long sequences?

Path complexity is $O(1)$.

Path complexity is $O(N^2)$.

Path complexity is $O(N)$, causing vanishing gradients.

It prevents the use of LSTMs.

Question 2

In pre-Attention Seq2Seq models, what component represents the 'information bottleneck'?

The softmax layer.

The recurrent cell (e.g., GRU).

The fixed-size context vector derived from the encoder's final hidden state.

The input embedding layer.

Challenge: Conceptualizing Attention's Advantage

Comparing Structural Complexity

Consider a sequence of length $N$. We want to establish a dependency between token $X_i$ and token $Y_j$.

Contrast the dependency path length required by:

Traditional Recurrence (e.g., LSTM)
Attention Mechanism (Query-Key comparison)

Step 1

How does Attention fundamentally reduce the structural complexity of establishing distant dependencies?

Solution:
Attention creates a direct, non-sequential connection between any output token $Y_j$ and any input token $X_i$ by calculating a weight based on their vector similarity ($Q_j K_i^T$). The dependency path length is effectively $O(1)$ (a direct look-up), removing the constraint of linear path traversal imposed by recurrence ($O(N)$).